The rapid advancement of large language models (LLMs), such as GPT-3 and its successors like GPT-4, has revolutionized the field of natural language processing (NLP). These models can generate human-like text with unprecedented fluency, coherence, and contextual awareness, making them invaluable tools for a wide range of applications, including content creation, language translation, and automated customer service. However, this technological breakthrough also brings forth significant concerns regarding the potential misuse of AI-generated text.
As LLMs become more accessible and their use more widespread, the ability to distinguish between human-written and machine-generated text has become an urgent need. The proliferation of AI-generated content poses risks such as misinformation, plagiarism, and the erosion of trust in digital communications. Therefore, developing robust techniques for accurately detecting AI-generated text is crucial to mitigate these risks and ensure the responsible use of LLMs.
The "LLM Detect AI Generated Text" competition hosted by Kaggle addresses this pressing challenge. By providing a dataset comprising a mixture of human-written and LLM-generated text samples, the competition encourages participants to develop machine learning models that can accurately classify each text sample as either human-written or AI-generated. This endeavor not only helps in building safeguards against the misuse of AI technologies but also advances the field of NLP by pushing the boundaries of what these models can achieve.
The primary objective of this project is to develop a highly accurate and robust model capable of distinguishing between human-written text and text generated by large language models (LLMs) such as GPT-4. Achieving this involves leveraging cutting-edge machine learning techniques, with a particular focus on natural language processing (NLP) and deep learning methodologies, to analyze and discern the linguistic patterns and characteristics inherent in the text samples provided.
To enhance the robustness of our model, we will consider expanding the dataset with additional samples and perform thorough Exploratory Data Analysis (EDA). This process will help us gain deeper insights into the data, revealing underlying patterns and characteristics that can inform our modeling approach. The findings from the EDA will be meticulously analyzed and presented within this notebook.
The model must be adept at identifying subtle differences between human and AI-generated text, considering various dimensions such as coherence, fluency, contextual relevance, and stylistic nuances. Furthermore, it should be capable of handling the complexities and intricacies of natural language, including idiomatic expressions, metaphors, and cultural references, which are often challenging for LLMs to emulate accurately.
Another critical aspect of the model is its ability to generalize well to unseen data, ensuring its effectiveness in real-world applications involving LLM-generated text. This necessitates a careful examination of potential biases and limitations in the training data, alongside employing robust evaluation techniques to measure the model's performance across diverse and representative test sets.
As large language models (LLMs) continue to evolve and find applications across various industries and domains, the ability to detect AI-generated text is becoming increasingly vital for maintaining trust, transparency, and accountability. In academic and research settings, accurate detection of AI-generated text is essential for upholding the integrity of scholarly work, preventing plagiarism, and ensuring adherence to principles of academic honesty and ethical conduct.
In legal and financial sectors, where the authenticity and provenance of documents are paramount, the capability to identify AI-generated text can play a crucial role in preventing fraudulent activities, such as contract forgery, identity theft, and other forms of deception involving AI-generated content. This ability is critical for maintaining the integrity of legal and financial transactions.
Furthermore, the detection of AI-generated text has significant implications for content moderation and online safety. As LLMs grow more sophisticated, there is a heightened risk of their misuse in generating and disseminating misinformation, hate speech, and other harmful content on a large scale. Developing accurate detection models can help in identifying and curbing the spread of such content, thereby fostering a safer and more trustworthy online environment.
Beyond these specific applications, the development of effective models for detecting AI-generated text contributes to ongoing research in NLP and artificial intelligence. It enhances our understanding of language models, their capabilities, and their limitations, promoting more responsible and ethical use of these powerful technologies.
The dataset provided for this competition consists of a diverse collection of text samples, some written by humans and others generated by large language models (LLMs). Human-written texts cover a wide range of topics and styles, reflecting the natural variability and richness of human language. In contrast, LLM-generated texts are produced by advanced AI models, which, despite their sophistication, often exhibit subtle yet detectable differences in structure, coherence, and style compared to human-authored texts.
One of the primary challenges in this project is the inherent similarity between human-written and AI-generated text. As LLMs have improved, their output has become increasingly indistinguishable from human writing. This makes the task of classification particularly difficult, requiring models that can capture and analyze minute differences and patterns that might not be immediately apparent.
Additionally, the dataset may contain biases introduced by the LLMs during text generation, as well as biases present in the human-written texts. Identifying and mitigating these biases is crucial for developing fair and unbiased models. Another challenge is the potential overfitting of models to specific patterns in the training data, which can limit their ability to generalize to new, unseen data.
Technical challenges also include handling the nuances and complexities of natural language. This involves accounting for idiomatic expressions, metaphors, cultural references, and varying linguistic styles, all of which can impact the performance of the detection model. Moreover, the computational requirements for training sophisticated NLP models, particularly those based on deep learning, can be substantial, necessitating efficient use of resources and optimization techniques.
By addressing these challenges, this project aims to develop a model that not only achieves high accuracy in distinguishing between human-written and AI-generated text but also provides insights into the broader implications of AI-generated content and its detection.
import os
os.environ["KERAS_BACKEND"] = "jax" # "jax" or "tensorflow" or "torch"
# Install necessary packages (you may want to run this separately)
#!pip install torch tensorflow numpy pandas matplotlib seaborn nltk wordcloud scikit-learn pyarrow
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
# Torch import
import torch
# Standard data science imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
# NLTK imports
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.util import ngrams
# Download necessary NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('vader_lexicon', quiet=True)
# Other imports
import string
import re
from collections import Counter
from wordcloud import WordCloud
# Scikit-learn imports
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report, roc_auc_score, roc_curve
)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import LabelEncoder
# Colormap
cmap = mpl.cm.get_cmap('coolwarm')
from google.colab import drive
drive.mount('/content/drive')
# Define the path to save the files in Google Drive
drive_path = '/content/drive/My Drive/AI Generated Text/'
Mounted at /content/drive
Load data
kaggle_path = '/content/drive/My Drive/AI Generated Text/kaggle Datasets/'
train_essays = pd.read_csv(kaggle_path + 'train_essays.csv')
train_prompts = pd.read_csv(kaggle_path + 'train_prompts.csv')
test_essays = pd.read_csv(kaggle_path + 'test_essays.csv')
sample_submission = pd.read_csv(kaggle_path + 'sample_submission.csv')
Mounted at /content/drive
print('train_essays Info:' + '\n' + '-' * 30)
print(train_essays.info())
print('\ntrain_prompts Info:' + '\n' + '-' * 30)
print(train_prompts.info())
print('\ntest_essays Info:' + '\n' + '-' * 30)
print(test_essays.info())
print('\nsample_submission Info:' + '\n' + '-' * 30)
print(sample_submission.info())
train_essays Info: ------------------------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 1378 entries, 0 to 1377 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 1378 non-null object 1 prompt_id 1378 non-null int64 2 text 1378 non-null object 3 generated 1378 non-null int64 dtypes: int64(2), object(2) memory usage: 43.2+ KB None train_prompts Info: ------------------------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 2 entries, 0 to 1 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 prompt_id 2 non-null int64 1 prompt_name 2 non-null object 2 instructions 2 non-null object 3 source_text 2 non-null object dtypes: int64(1), object(3) memory usage: 192.0+ bytes None test_essays Info: ------------------------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 3 non-null object 1 prompt_id 3 non-null int64 2 text 3 non-null object dtypes: int64(1), object(2) memory usage: 200.0+ bytes None sample_submission Info: ------------------------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 3 non-null object 1 generated 3 non-null float64 dtypes: float64(1), object(1) memory usage: 176.0+ bytes None
In this section, we outline the key components of our project, including the original training data from Kaggle, the external datasets we will use to enhance our model, the preprocessing steps to prepare the data, and the exploratory data analysis (EDA) to understand the data better.
# Display the first few rows of each dataset to understand their structure
print("Train Essays:\n", train_essays.head(), "\n")
# Summary statistics for train essays
print("Train Essays Statistics:\n", train_essays.describe(), "\n")
# Check for missing values in train essays
print("Train Essays Missing Values:\n", train_essays.isnull().sum())
Train Essays:
id prompt_id text \
0 0059830c 0 Cars. Cars have been around since they became ...
1 005db917 0 Transportation is a large necessity in most co...
2 008f63e3 0 "America's love affair with it's vehicles seem...
3 00940276 0 How often do you ride in a car? Do you drive a...
4 00c39458 0 Cars are a wonderful thing. They are perhaps o...
generated essay_length
0 0 584
1 0 462
2 0 744
3 0 686
4 0 871
Train Essays Statistics:
prompt_id generated essay_length
count 1378.000000 1378.000000 1378.000000
mean 0.486212 0.002177 556.124093
std 0.499991 0.046625 159.920005
min 0.000000 0.000000 235.000000
25% 0.000000 0.000000 447.000000
50% 0.000000 0.000000 525.000000
75% 1.000000 0.000000 639.000000
max 1.000000 1.000000 1360.000000
Train Essays Missing Values:
id 0
prompt_id 0
text 0
generated 0
essay_length 0
dtype: int64
# Visualize the distribution of essay lengths in the train dataset
train_essays['essay_length'] = train_essays['text'].apply(lambda x: len(str(x).split()))
sns.histplot(train_essays['essay_length'], bins=50, kde=True)
plt.title('Distribution of Essay Lengths in Train Data')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.show()
# Explore the relationship between prompts and essays
plt.figure(figsize=(8, 6))
sns.countplot(x='prompt_id', data=train_essays)
plt.title('Distribution of Essays by Prompt ID')
plt.xlabel('Prompt ID')
plt.ylabel('Number of Essays')
plt.show()
The training data shows a variety of essay lengths but is limited to only two prompts. In contrast, the test data uses seven prompts, highlighting the need for a more diverse training set to ensure model generalization.
# Visualize the number of generated and non-generated texts
plt.figure(figsize=(8, 6))
sns.countplot(x='generated', data=train_essays)
plt.title('Distribution of Generated vs. Non-Generated Texts')
plt.xlabel('Generated (0 = Human, 1 = AI)')
plt.ylabel('Number of Essays')
plt.show()
The dataset provided by Kaggle consists predominantly of human-written text. To achieve our goal of distinguishing between human-made and AI-generated text, we require a dataset that includes both. After reviewing the Kaggle competition page, we discovered that this issue has been identified by others as well. Many users have already created and shared datasets containing text generated by various AI models.
We will integrate the original Kaggle data with these external datasets to create a more comprehensive dataset suitable for training our model.
external_path = '/content/drive/My Drive/AI Generated Text/External Datasets/'
argugpt = pd.read_csv(external_path + 'argugpt.csv')
train_drcat_04 = pd.read_csv(external_path + 'train_drcat_04.csv')
train_drcat_03 = pd.read_csv(external_path + 'train_drcat_03.csv')
train_drcat_02 = pd.read_csv(external_path + 'train_drcat_02.csv')
train_drcat_01 = pd.read_csv(external_path + 'train_drcat_01.csv')
print('argugpt Info:' + '\n' + '-' * 30)
print(argugpt.info())
print('\ntrain_drcat_04 Info:' + '\n' + '-' * 30)
print(train_drcat_04.info())
print('\ntrain_drcat_03 Info:' + '\n' + '-' * 30)
print(train_drcat_03.info())
print('\ntrain_drcat_02 Info:' + '\n' + '-' * 30)
print(train_drcat_02.info())
print('\ntrain_drcat_01 Info:' + '\n' + '-' * 30)
print(train_drcat_01.info())
argugpt Info: ------------------------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 4038 entries, 0 to 4037 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 4038 non-null object 1 prompt_id 4038 non-null object 2 prompt 4038 non-null object 3 text 4038 non-null object 4 model 4038 non-null object 5 temperature 4038 non-null float64 6 exam_type 4038 non-null object 7 score 4038 non-null int64 8 score_level 4038 non-null object dtypes: float64(1), int64(1), object(7) memory usage: 284.0+ KB None train_drcat_04 Info: ------------------------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 44206 entries, 0 to 44205 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 essay_id 44206 non-null object 1 text 44206 non-null object 2 label 44206 non-null int64 3 source 44206 non-null object 4 prompt 12911 non-null object 5 fold 44206 non-null int64 dtypes: int64(2), object(4) memory usage: 2.0+ MB None train_drcat_03 Info: ------------------------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 42206 entries, 0 to 42205 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 essay_id 42206 non-null object 1 text 42206 non-null object 2 label 42206 non-null int64 3 source 42206 non-null object 4 prompt 11911 non-null object 5 fold 42206 non-null int64 dtypes: int64(2), object(4) memory usage: 1.9+ MB None train_drcat_02 Info: ------------------------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 39785 entries, 0 to 39784 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 essay_id 39785 non-null object 1 text 39785 non-null object 2 label 39785 non-null int64 3 source 39785 non-null object 4 prompt 9490 non-null object 5 fold 39785 non-null int64 dtypes: int64(2), object(4) memory usage: 1.8+ MB None train_drcat_01 Info: ------------------------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 33259 entries, 0 to 33258 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 text 33259 non-null object 1 label 33259 non-null int64 2 source 33259 non-null object 3 fold 33259 non-null int64 dtypes: int64(2), object(2) memory usage: 1.0+ MB None
Combine all train_drcat files into a single DataFrame
# Combine all train_drcat files into a single DataFrame
train_drcat_combined = pd.concat([train_drcat_01, train_drcat_02, train_drcat_03, train_drcat_04], ignore_index=True)
train_drcat_combined.rename(columns = {"label":"generated"}, inplace=True)
train_drcat_combined.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 159456 entries, 0 to 159455 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 text 159456 non-null object 1 generated 159456 non-null int64 2 source 159456 non-null object 3 fold 159456 non-null int64 4 essay_id 126197 non-null object 5 prompt 34312 non-null object dtypes: int64(2), object(4) memory usage: 7.3+ MB
# Visualize the number of generated and non-generated texts
plt.figure(figsize=(8, 6))
sns.countplot(x='generated', data=train_drcat_combined)
plt.title('Distribution of Generated vs. Non-Generated Texts')
plt.xlabel('Generated (0 = Human, 1 = AI)')
plt.ylabel('Number of Essays')
plt.show()
The argugpt dataset contains generated text without any labels. To integrate it with other external datasets, we will add a 'generated' column with a value of 1 for each row, indicating that the text is AI-generated.
argugpt['generated'] = 1
argugpt.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4038 entries, 0 to 4037 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 4038 non-null object 1 prompt_id 4038 non-null object 2 prompt 4038 non-null object 3 text 4038 non-null object 4 model 4038 non-null object 5 temperature 4038 non-null float64 6 exam_type 4038 non-null object 7 score 4038 non-null int64 8 score_level 4038 non-null object 9 generated 4038 non-null int64 dtypes: float64(1), int64(2), object(7) memory usage: 315.6+ KB
# Visualize the number of generated and non-generated texts
plt.figure(figsize=(8, 6))
sns.countplot(x='generated', data=argugpt)
plt.title('Distribution of Generated vs. Non-Generated Texts')
plt.xlabel('Generated (0 = Human, 1 = AI)')
plt.ylabel('Number of Essays')
plt.show()
data = pd.concat([train_drcat_combined[["text", "generated"]], argugpt[["text", "generated"]], train_essays[["text", "generated"]]])
data.info()
<class 'pandas.core.frame.DataFrame'> Index: 164872 entries, 0 to 1377 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 text 164872 non-null object 1 generated 164872 non-null int64 dtypes: int64(1), object(1) memory usage: 3.8+ MB
data.to_parquet(drive_path + 'multiple_essays.parquet', index=False)
We saved the dataset to facilitate working with it more efficiently. By saving the cleaned and processed dataset, we can directly perform EDA and build models without the need to reconstruct the dataset each time, thus streamlining the workflow and saving time.
data = pd.read_parquet(drive_path + 'multiple_essays.parquet')
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 164872 entries, 0 to 164871 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 text 164872 non-null object 1 generated 164872 non-null int64 dtypes: int64(1), object(1) memory usage: 2.5+ MB
# Visualize the number of generated and non-generated texts
plt.figure(figsize=(8, 6))
sns.countplot(x='generated', data=data)
plt.title('Distribution of Generated vs. Non-Generated Texts')
plt.xlabel('Generated (0 = Human, 1 = AI)')
plt.ylabel('Number of Essays')
plt.show()
# Function to clean text
def clean_text(text):
# Remove punctuation and numbers
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Convert to lowercase
text = text.lower()
# Remove stop words
stop_words = set(stopwords.words('english'))
words = text.split()
text = ' '.join([word for word in words if word not in stop_words])
return text
# Apply text cleaning
data['cleaned_text'] = data['text'].apply(clean_text)
print("Cleaned Text Data:\n", data['cleaned_text'])
Cleaned Text Data:
0 alot reasons keep despised method choosing pre...
1 driving smart cars drive themself pros cons pe...
2 dear principal believe students school allowed...
3 dear principal community service required stud...
4 argument development driverless cars cars allo...
...
164867 fuss elector college many people get confused ...
164868 limiting car usage many advantages putting lot...
164869 theres new trend developing years soon full th...
164870 know cars big part society today however cars ...
164871 cars around since popular ever since although ...
Name: cleaned_text, Length: 164872, dtype: object
# Function to find removed stop words
def find_removed_stop_words(original_text, cleaned_text):
original_words = set(original_text.lower().split())
cleaned_words = set(cleaned_text.split())
removed_words = original_words - cleaned_words
return removed_words
# Apply the function to the first few rows of the dataset
data['removed_stop_words'] = data.apply(lambda row: find_removed_stop_words(row['text'], row['cleaned_text']), axis=1)
# Display the original text, cleaned text, and removed stop words for the first few rows
data[['text', 'cleaned_text', 'removed_stop_words']].head()
| text | cleaned_text | removed_stop_words | |
|---|---|---|---|
| 0 | There are alot reasons to keep our the despise... | alot reasons keep despised method choosing pre... | {his, to, "it, a, '68, state,, romney..., "ric... |
| 1 | Driving smart cars that drive by themself has ... | driving smart cars drive themself pros cons pe... | {now, to,, fuel,, wrecks., to, to?, drivers., ... |
| 2 | Dear Principal,\n\nI believe that students at ... | dear principal believe students school allowed... | {has, which, them, at, noise,, with, grades., ... |
| 3 | Dear Principal,\n\nCommunity service should no... | dear principal community service required stud... | {own, to, can, a, they, their, forced., are, r... |
| 4 | My argument for the development of the driverl... | argument development driverless cars cars allo... | {lives., world., environment., to, can, a, the... |
Tokenization is the process of breaking down a piece of text into smaller units called tokens. These tokens can be words, phrases, or even characters. Tokenization is a fundamental step in natural language processing (NLP) and text analysis because it converts unstructured text into a structured format that can be more easily analyzed and processed by machine learning algorithms.
By breaking text into tokens, it becomes easier to apply other preprocessing steps such as lowercasing, removing punctuation, and filtering out stop words.
For the purpose of understanding and diving into the data, we will look at both the clean data and the original data.
# Tokenize the cleaned texts
data['cleaned_tokens'] = data['cleaned_text'].apply(nltk.word_tokenize)
print("Tokenized Text Data:\n", data[['cleaned_text', 'cleaned_tokens']])
# Tokenize the original texts
data['tokens'] = data['text'].apply(nltk.word_tokenize)
print("Tokenized Text Data:\n", data[['text', 'tokens']])
Tokenized Text Data:
cleaned_text \
0 alot reasons keep despised method choosing pre...
1 driving smart cars drive themself pros cons pe...
2 dear principal believe students school allowed...
3 dear principal community service required stud...
4 argument development driverless cars cars allo...
... ...
164867 fuss elector college many people get confused ...
164868 limiting car usage many advantages putting lot...
164869 theres new trend developing years soon full th...
164870 know cars big part society today however cars ...
164871 cars around since popular ever since although ...
cleaned_tokens
0 [alot, reasons, keep, despised, method, choosi...
1 [driving, smart, cars, drive, themself, pros, ...
2 [dear, principal, believe, students, school, a...
3 [dear, principal, community, service, required...
4 [argument, development, driverless, cars, cars...
... ...
164867 [fuss, elector, college, many, people, get, co...
164868 [limiting, car, usage, many, advantages, putti...
164869 [theres, new, trend, developing, years, soon, ...
164870 [know, cars, big, part, society, today, howeve...
164871 [cars, around, since, popular, ever, since, al...
[164872 rows x 2 columns]
Tokenized Text Data:
text \
0 There are alot reasons to keep our the despise...
1 Driving smart cars that drive by themself has ...
2 Dear Principal,\n\nI believe that students at ...
3 Dear Principal,\n\nCommunity service should no...
4 My argument for the development of the driverl...
... ...
164867 There has been a fuss about the Elector Colleg...
164868 Limiting car usage has many advantages. Such a...
164869 There's a new trend that has been developing f...
164870 As we all know cars are a big part of our soci...
164871 Cars have been around since the 1800's and hav...
tokens
0 [There, are, alot, reasons, to, keep, our, the...
1 [Driving, smart, cars, that, drive, by, themse...
2 [Dear, Principal, ,, I, believe, that, student...
3 [Dear, Principal, ,, Community, service, shoul...
4 [My, argument, for, the, development, of, the,...
... ...
164867 [There, has, been, a, fuss, about, the, Electo...
164868 [Limiting, car, usage, has, many, advantages, ...
164869 [There, 's, a, new, trend, that, has, been, de...
164870 [As, we, all, know, cars, are, a, big, part, o...
164871 [Cars, have, been, around, since, the, 1800, '...
[164872 rows x 2 columns]
data.to_parquet(drive_path + 'multiple_essays&cleaned_text&tokens.parquet', index=False)
load data
data = pd.read_parquet(drive_path + 'multiple_essays&cleaned_text&tokens.parquet')
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 164872 entries, 0 to 164871 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 text 164872 non-null object 1 generated 164872 non-null int64 2 cleaned_text 164872 non-null object 3 cleaned_tokens 164872 non-null object 4 tokens 164872 non-null object dtypes: int64(1), object(4) memory usage: 6.3+ MB
Tokenization allows us to calculate the frequency of a word, which is useful for text classification and sentiment analysis.
The most common words in the original text include many stop words such as "the", "to", "and", "a", "of", etc., which are frequent but not informative for text analysis tasks. In contrast, the cleaned text show more meaningful words that carry actual information, as the stop words and punctuation have been removed.
By comparing the most frequent words in the original and cleaned text, we can clearly see the impact of stop words and why text cleaning is essential. This approach highlights how removing stop words and other noise leads to more meaningful and relevant features for subsequent analysis and modeling.
What We Will Gain From This Insights? Understanding the most common words used in both original and cleaned text data can provide valuable insights into the themes and topics prevalent in the dataset. This analysis helps in identifying frequent terms and understanding the overall content structure, which is crucial for further text processing and modeling tasks.
Which Technology (Frameworks and etc) We Used? For this analysis, we utilized the following technologies:
Pandas: For data manipulation and handling the DataFrame. Matplotlib and Seaborn: For creating visualizations to display the most common words. Collections (Counter): For counting the frequency of words efficiently.
Issues We Had in the Way Till Managed to Implement It Initially, we faced issues with counting and visualizing the words correctly due to the way tokens were stored as strings. We resolved this by ensuring that the tokens were properly exploded into individual words. Additionally, handling large datasets and ensuring the visualizations were clear and informative required careful plotting and formatting.
# Original Texts Tokens
all_tokens = [token for tokens in data['tokens'] for token in tokens] # Flatten the list of tokens
word_freq = Counter(all_tokens) # Count word frequencies
most_common_words = word_freq.most_common(50) # Get the 50 most common words
words, counts = zip(*most_common_words) # Separate words and their counts for plotting
# Cleaned Text Tokens
cleaned_all_tokens = [token for tokens in data['cleaned_tokens'] for token in tokens]
cleaned_word_freq = Counter(cleaned_all_tokens)
cleaned_most_common_words = cleaned_word_freq.most_common(50)
cleaned_words, cleaned_counts = zip(*cleaned_most_common_words)
fig, axes = plt.subplots(1, 2, figsize=(24, 6)) # Plot the most common words
# Plot the most common words (original)
axes[0].bar(words, counts)
axes[0].set_title('Top 50 Most Common Words (Original Data)')
axes[0].set_xlabel('Words')
axes[0].set_ylabel('Frequency')
axes[0].tick_params(axis='x', rotation=90)
# Plot the most common words (cleaned data)
axes[1].bar(cleaned_words, cleaned_counts)
axes[1].set_title('Top 50 Most Common Words (Cleaned Data)')
axes[1].set_xlabel('Words')
axes[1].set_ylabel('Frequency')
axes[1].tick_params(axis='x', rotation=90)
# Show the plots
plt.tight_layout()
plt.show()
As seen in the original data, there are numerous values that hold little significance for our analysis, such as stop words, punctuation, and irrelevant symbols. By cleaning the data, we remove these extraneous elements, allowing us to focus on the meaningful content and gain more accurate insights from our exploratory data analysis (EDA).
By choosing to analyze the most common words only on the cleaned data, we ensure that our analysis is focused on the most relevant and informative parts of the text.
# Count word frequencies by label
def get_most_common_words_by_label(data, label_column, token_column, num_words=30):
label_groups = data.groupby(label_column)
most_common_words_by_label = {}
for label, group in label_groups:
tokens = [token for tokens in group[token_column] for token in tokens]
word_freq = Counter(tokens)
most_common_words_by_label[label] = word_freq.most_common(num_words)
return most_common_words_by_label
# Get the most common words for cleaned text by label
most_common_cleaned_words_by_label = get_most_common_words_by_label(data, 'generated', 'cleaned_tokens')
# print("Most Common Words in Cleaned Text by Label:\n", most_common_cleaned_words_by_label)
# Combine the data into a single DataFrame
combined_data = []
for label, common_words in most_common_cleaned_words_by_label.items():
for word, count in common_words:
combined_data.append({'label': label, 'word': word, 'count': count})
df_combined = pd.DataFrame(combined_data)
#print(df_combined)
# Mapping labels
label_mapping = {0: 'Human', 1: 'Generated'}
df_combined['label'] = df_combined['label'].map(label_mapping)
# Prepare the data for plotting
plot_data = df_combined.pivot(index='word', columns='label', values='count').fillna(0)
plot_data = plot_data.reindex(plot_data.sum(axis=1).sort_values(ascending=False).index) # Sort by total count
all_common_words = plot_data.index
labels = plot_data.columns
# Plotting
fig, ax = plt.subplots(figsize=(15, 8))
word_positions = range(len(all_common_words))
colors = ['orange', 'skyblue']
# Plot stacked bars for each label
for i, label in enumerate(labels):
ax.bar(word_positions, plot_data[label],
color=colors[i], label=f'{label}',
bottom=plot_data[labels[:i]].sum(axis=1))
# Annotate each bar with the count value
for i, word in enumerate(all_common_words):
for j, label in enumerate(labels):
count = plot_data.loc[word, label]
if count > 0:
ax.text(i, plot_data[labels[:j]].sum(axis=1).iloc[i] + count / 2, int(count),
ha='center', va='center', rotation=90, fontsize=10, color='black')
# Customize the plot
ax.set_xticks(word_positions)
ax.set_xticklabels(all_common_words, rotation=90, ha='right')
ax.set_title('Most Common Words by Label (Cleaned Text) ')
ax.set_xlabel('Words')
ax.set_ylabel('Frequency')
ax.legend(title='Label')
plt.tight_layout()
plt.show()
# plot most common words by label
label_names = {0: 'Human', 1: 'Generated'}
def plot_most_common_words_by_label(most_common_words_by_label, title_suffix):
num_labels = len(most_common_words_by_label)
fig, axes = plt.subplots(num_labels, 1, figsize=(20, 5 * num_labels), squeeze=False)
for i, (label, common_words) in enumerate(most_common_words_by_label.items()):
# Cleaned text
cleaned_words, cleaned_counts = zip(*common_words)
ax = axes[i, 0]
color = 'skyblue' if label == 0 else 'orange'
bars = ax.bar(cleaned_words, cleaned_counts, color=color)
# Add annotations in the center of the bars
for bar in bars:
height = bar.get_height()
ax.annotate('{}'.format(height),
xy=(bar.get_x() + bar.get_width() / 2, height / 2),
ha='center', va='center', rotation=90, fontsize=10, color='black', weight='bold')
label_name = label_names.get(label, f'Label {label}')
ax.set_title(f'{label_name} - Top {len(cleaned_words)} Most Common Words (Cleaned Text)')
ax.set_xlabel('Words')
ax.set_ylabel('Frequency')
ax.tick_params(axis='x', rotation=90)
plt.tight_layout()
plt.show()
# Plot the most common words for cleaned text by label
plot_most_common_words_by_label(most_common_cleaned_words_by_label, 'Cleaned Text')
Creating a word cloud is an effective way to visualize the most frequent words in our text data. Similar to plotting the most common words, we generate the word cloud after cleaning and tokenizing our text data to ensure that the visualization is meaningful and free from noise.
from wordcloud import WordCloud
# Combine all cleaned text into a single string
all_cleaned_text = ' '.join(data['cleaned_text'])
# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_cleaned_text)
# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Cleaned Text')
plt.show()
Word Cloud by Each Label - Creating word clouds by each label allows us to visualize the most frequent words associated with each category in the dataset. This can provide insights into the distinguishing features of different classes.
labels = data['generated'].unique()
#num_labels = len(labels)
label_names = {0: 'Human-Written', 1: 'AI Generated'}
fig, axes = plt.subplots(1, 2, figsize=(20, 5), squeeze=False)
# Generate word clouds for each label
for i, label in enumerate(labels):
label_text = ' '.join(data[data['generated'] == label]['cleaned_text'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(label_text)
# Plot the word cloud in the corresponding subplot
label_name = label_names.get(label, f'Label {label}')
ax = axes[0, i]
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
ax.set_title(f'Word Cloud of {label_name} Text')
plt.tight_layout()
plt.show()
The graph above illustrates a distinct pattern in word usage between human-written and AI-generated texts. While both types of text may appear similar at first glance, this analysis reveals subtle but significant differences in vocabulary preferences.
Key observations:
Implications:
Limitations:
In conclusion, while it remains challenging to definitively identify the source of a text based solely on word choice, this analysis provides valuable insights into the linguistic patterns of human and AI-generated content. It underscores the importance of careful examination and the potential for more sophisticated analysis techniques in distinguishing between human and AI-authored texts.
N-grams analysis aims to uncover patterns and co-occurrences of words that carry meaningful information. Therefore, it's generally best to use cleaned data for exploring N-grams.
What We Will Gain From This Insights? Analyzing n-grams (bigrams, trigrams, fourgrams) by label allows us to understand the common phrases and sequences of words in the dataset. This can reveal syntactic patterns and common expressions used in both human and AI-generated texts, providing deeper insights into linguistic structures and potential distinguishing features.
Which Technology (Frameworks and etc) We Used? For n-gram generation and visualization, we used:
NLTK: For tokenizing text and generating n-grams. Collections (Counter): For counting the frequency of n-grams. Matplotlib: For visualizing the most common n-grams. Issues We Had in the Way Till Managed to Implement It Handling large text data for n-gram generation was computationally intensive and required efficient text processing. Ensuring accurate tokenization and n-gram generation involved resolving issues with punctuation and case sensitivity. Visualizing the n-grams in a clear and interpretable manner also required careful plotting and formatting adjustments.
def generate_ngrams(text, n):
tokens = nltk.word_tokenize(text.lower())
return [' '.join(gram) for gram in ngrams(tokens, n)]
def plot_ngrams(ngram_dict, title, top_n=20):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
for ax, (label, ngram_freq) in zip([ax1, ax2], ngram_dict.items()):
ngram_common = ngram_freq.most_common(top_n)
ngrams, freqs = zip(*ngram_common)
ax.barh(ngrams, freqs, color='skyblue' if label == 'Human' else 'orange')
ax.set_xlabel('Frequency')
ax.set_title(f'{title} - {label}')
ax.invert_yaxis()
plt.tight_layout()
plt.show()
def generate_and_plot_ngrams(data, n, top_n=20):
ngram_dict = {'Human': Counter(), 'Generated': Counter()}
for label, text in zip(data['generated'], data['cleaned_text']):
label_key = 'Human' if label == 0 else 'Generated'
ngram_dict[label_key].update(generate_ngrams(text, n))
plot_ngrams(ngram_dict, f'Top {top_n} Most Common {n}-grams', top_n)
# Generate and plot bigrams, trigrams, and fourgrams
for n in [2, 3, 4]:
generate_and_plot_ngrams(data, n)
Our analysis of n-grams (bigrams, trigrams, and fourgrams) in human-written and AI-generated texts reveals increasingly distinct patterns as we delve deeper into the linguistic structures:
Increasing Differentiation: As we progress from single words to more complex n-grams, the differences between human and AI-generated text become more pronounced. This suggests that while individual word choices might be similar, the way these words are combined and structured differs significantly between human and AI authors.
Distinct Expressions: Humans and AI tend to use different phrases and expressions. This distinction is particularly evident in the trigrams and fourgrams, where we can see complete phrases or partial sentences that are characteristic of each source.
Stylistic Patterns: The n-gram analysis uncovers stylistic patterns that may not be immediately apparent in a casual reading. These patterns could reflect the underlying mechanisms of human language production versus AI text generation algorithms.
Potential for Detection: The clear differences in n-gram usage provide a promising avenue for developing more sophisticated AI text detection methods. By analyzing these multi-word structures, we may be able to identify AI-generated text with greater accuracy than by looking at individual words alone.
Contextual Usage: The differences in n-grams also hint at variations in contextual understanding and usage between humans and AI. Humans might use certain phrases in ways that reflect cultural knowledge or common experiences, while AI-generated text might show patterns based on its training data and generation algorithms.
Evolving Landscape: It's important to note that as AI language models continue to evolve, these differences may become more subtle. Ongoing analysis will be crucial to keep pace with advancements in AI text generation.
Implications:
In conclusion, while surface-level similarities between human and AI-generated text can make distinction challenging, a deeper analysis of n-grams reveals fundamental differences in language patterns. This provides not only a method for potential identification of AI-generated text but also insights into the underlying processes of language production in both humans and AI systems.
Sentiment analysis is a technique used to determine the emotional tone behind a series of words, helping to understand the attitudes, opinions, and emotions expressed in text. In our text classification task, sentiment scores are crucial features that provide additional context to the text data.
What We Will Gain From This Insights? Sentiment analysis provides insights into the emotional tone and sentiment distribution within the text data. By understanding the sentiment scores and classifications, we can identify patterns of positive, negative, and neutral sentiments in both human and AI-generated texts, which can be crucial for tasks such as content moderation and sentiment-based modeling.
Which Technology (Frameworks and etc) We Used? For sentiment analysis, we used
Pandas: For handling sentiment scores and classifications in the DataFrame. Seaborn and Matplotlib: For visualizing the sentiment distributions and counts.
Issues We Had in the Way Till Managed to Implement It Handling and visualizing sentiment data required efficient computation and clear plotting. Ensuring the sentiment classifications were correctly interpreted and visualized involved careful data manipulation and formatting. Additionally, creating meaningful and interpretable visualizations required adjustments to plot settings and annotations.
Sentiment Polarity:
# Initialize VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()
def get_sentiment(text):
scores = sia.polarity_scores(text)
if scores['compound'] >= 0.05:
return 'positive'
elif scores['compound'] <= -0.05:
return 'negative'
else:
return 'neutral'
# Apply sentiment analysis
data['sentiment'] = data['text'].apply(get_sentiment)
print("Sentiment Data:\n", data[['text', 'sentiment']])
# Count sentiments by label
sentiment_counts = data.groupby(['generated', 'sentiment']).size().unstack(fill_value=0)
print("Sentiment Counts by Label:\n", sentiment_counts)
Sentiment Data:
text sentiment
0 There are alot reasons to keep our the despise... positive
1 Driving smart cars that drive by themself has ... negative
2 Dear Principal,\n\nI believe that students at ... negative
3 Dear Principal,\n\nCommunity service should no... negative
4 My argument for the development of the driverl... positive
... ... ...
164867 There has been a fuss about the Elector Colleg... positive
164868 Limiting car usage has many advantages. Such a... positive
164869 There's a new trend that has been developing f... positive
164870 As we all know cars are a big part of our soci... positive
164871 Cars have been around since the 1800's and hav... positive
[164872 rows x 2 columns]
Sentiment Counts by Label:
sentiment negative neutral positive
generated
0 21380 518 94849
1 2831 89 45205
vis_sentiment_counts = sentiment_counts.reset_index()
# Melt the DataFrame for seaborn
sentiment_counts_melted = vis_sentiment_counts.melt(id_vars='generated', value_vars=['positive', 'negative', 'neutral'],
var_name='sentiment', value_name='count')
# Plotting
plt.figure(figsize=(10, 6))
barplot = sns.barplot(data=sentiment_counts_melted, x='generated', y='count', hue='sentiment', palette='viridis')
# Annotate each bar with the count value
for p in barplot.patches:
if p.get_height() > 0:
barplot.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
xytext = (0, 9), textcoords = 'offset points')
plt.title('Sentiment Counts by Label')
plt.xlabel('label (0 = Human, 1 = AI)')
plt.ylabel('Count')
plt.show()
Sentiment analysis provide scores for positive, negative, neutral, and compound sentiments:
sia = SentimentIntensityAnalyzer()
# Function to calculate sentiment scores
def get_sentiment_scores(text):
return sia.polarity_scores(text)
# Apply sentiment analysis to the original text data
data['sentiment'] = data['text'].apply(get_sentiment_scores)
# Extract sentiment features
data['positive sentiment'] = data['sentiment'].apply(lambda x: x['pos'])
data['negative sentiment'] = data['sentiment'].apply(lambda x: x['neg'])
data['neutral sentiment'] = data['sentiment'].apply(lambda x: x['neu'])
data['compound sentiment'] = data['sentiment'].apply(lambda x: x['compound'])
Save Data
data.to_parquet(drive_path + 'multiple_essays&cleaned_text&tokens&sentiments.parquet', index=False)
# Group by labels and calculate average sentiment scores
label_sentiment = data.groupby('generated')[['positive sentiment', 'negative sentiment', 'neutral sentiment', 'compound sentiment']].mean().reset_index()
# Bar plot for average sentiment scores by label
plt.figure(figsize=(12, 6))
label_sentiment.plot(x='generated', kind='bar', stacked=True, figsize=(12, 6))
plt.title('Average Sentiment Scores by Label')
plt.xlabel('Generated (0 = Human, 1 = AI)')
plt.ylabel('Average Sentiment Score')
plt.legend(loc='upper right')
plt.show()
<Figure size 1200x600 with 0 Axes>
# Plot distribution graphs for each sentiment score
fig, axes = plt.subplots(2, 2, figsize=(15, 10), sharex=False)
label_names = {0: 'Human-Written', 1: 'AI Generated'}
sentiments = ['positive sentiment', 'negative sentiment', 'neutral sentiment', 'compound sentiment']
titles = ['Positive Sentiment Distribution', 'Negative Sentiment Distribution', 'Neutral Sentiment Distribution', 'Compound Sentiment Distribution']
for i, sentiment in enumerate(sentiments):
ax = axes[i//2, i%2]
for label in data['generated'].unique():
subset = data[data['generated'] == label]
label_name = label_names.get(label, f'Label {label}')
sns.histplot(subset[sentiment], kde=True, label=f'{label_name}', bins=20, ax=ax)
ax.set_title(titles[i])
ax.set_xlabel('Sentiment Score')
ax.set_ylabel('Frequency')
ax.legend()
plt.tight_layout()
plt.show()
Main conclusion: While exploratory data analysis (EDA) reveals certain differences between human and AI-generated texts, these distinctions alone are insufficient for reliable text source classification. Therefore, advanced modeling techniques are necessary to accurately identify the text origin.
Most Common Words and Word Cloud: By analyzing the most common words and generating word clouds, we observed specific vocabulary patterns in both human and AI-generated texts. AI-generated texts tend to use more formal or technical terms, while human texts display a richer variety of idiomatic expressions. However, these differences are not definitive, indicating that word frequency and visual representations, while informative, cannot solely determine the text source.
N-grams and Sentiment Analysis: The n-gram analysis revealed recurring phrases and linguistic structures unique to each text type, and sentiment analysis provided insights into the emotional tone of the texts. AI-generated texts showed more uniform and neutral sentiment, whereas human texts exhibited greater emotional diversity. Despite these insights, the overlap and subtle differences mean that n-grams and sentiment patterns alone are inadequate for reliable classification. Therefore, these EDA findings should be integrated into more sophisticated machine learning models to achieve accurate text source identification.
data = pd.read_parquet(drive_path + 'multiple_essays&cleaned_text&tokens&sentiments.parquet')
TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer is essential for transforming text data into numerical features, making it suitable for machine learning algorithms. It also Measures how frequently a term appears in a document. The more frequent a term, the higher its TF score.
Using the TF-IDF Vectorizer on cleaned text data creates a robust and efficient representation of the text. This representation improves the performance of machine learning models for text classification tasks.
# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer(
max_features=3000, # You can adjust this based on your needs
min_df=2, # Remove terms that appear in less than 2 documents
max_df=0.85 # Remove terms that appear in more than 85% of documents
)
# Fit and transform the cleaned text data
tfidf_features = vectorizer.fit_transform(data['cleaned_text'])
# Convert the features to a DataFrame for better readability
tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=vectorizer.get_feature_names_out())
# Display the first few rows of the TF-IDF features
print(tfidf_df.head())
abilities ability able abolish abolished absolutely abuse \ 0 0.0 0.0 0.026795 0.0 0.0 0.000000 0.0 1 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 2 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 3 0.0 0.0 0.029761 0.0 0.0 0.077367 0.0 4 0.0 0.0 0.028382 0.0 0.0 0.000000 0.0 academic academically academics ... youll young younger youre \ 0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.000000 1 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.000000 2 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.000000 3 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.047755 4 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.000000 youth youve zip zone zones zoos 0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 0.0 0.0 3 0.0 0.0 0.0 0.0 0.0 0.0 4 0.0 0.0 0.0 0.0 0.0 0.0 [5 rows x 3000 columns]
tfidf_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 164872 entries, 0 to 164871 Columns: 3000 entries, abilities to zoos dtypes: float64(3000) memory usage: 3.7 GB
Reducing the Number of Features with PCA:
The TF-IDF vectorizer create a 3000 features, which can be computationally expensive and may lead to overfitting. To address this, we use Principal Component Analysis (PCA) to reduce the number of features to 100.
from sklearn.decomposition import PCA
# Apply PCA for dimensionality reduction
pca = PCA(n_components=100) # Reduce to 100 dimensions
reduced_tfidf_matrix = pca.fit_transform(tfidf_df)
# Generate column names for the 100 principal components
column_names = [f'Principal Component {i+1}' for i in range(100)]
# Convert the reduced matrix to DataFrame
reduced_tfidf_df = pd.DataFrame(reduced_tfidf_matrix, columns=column_names)
# Display the reduced TF-IDF DataFrame
print(reduced_tfidf_df)
Principal Component 1 Principal Component 2 Principal Component 3 \
0 0.480854 0.481324 -0.123376
1 0.191796 -0.369763 -0.229662
2 -0.184416 0.015185 -0.126379
3 -0.083348 0.044767 -0.013280
4 0.224137 -0.415361 -0.246997
... ... ... ...
164867 0.322541 0.305280 -0.070180
164868 0.138479 -0.297583 -0.198840
164869 0.196350 -0.314141 -0.187126
164870 0.248369 -0.438832 -0.268732
164871 0.213497 -0.364067 -0.230780
Principal Component 4 Principal Component 5 Principal Component 6 \
0 0.105633 0.043979 -0.002698
1 0.139634 -0.057858 0.049589
2 -0.006820 0.565616 -0.093322
3 -0.172115 -0.045479 0.317066
4 0.155747 -0.103111 0.048259
... ... ... ...
164867 0.045341 0.005261 0.003767
164868 0.110875 -0.076344 0.061336
164869 0.113935 -0.102107 0.033668
164870 0.178503 -0.082305 0.057043
164871 0.161508 -0.139866 0.059248
Principal Component 7 Principal Component 8 Principal Component 9 \
0 0.035242 0.000177 0.013380
1 0.023656 0.002799 -0.143271
2 0.012921 0.086002 0.078157
3 0.142402 0.230226 0.051246
4 0.018892 -0.022636 -0.109291
... ... ... ...
164867 0.011130 0.000025 -0.002996
164868 0.014728 -0.015575 -0.027731
164869 0.008286 -0.025159 0.040267
164870 0.035330 -0.021128 -0.052976
164871 0.040920 -0.014479 0.045265
Principal Component 10 ... Principal Component 91 \
0 -0.014916 ... -0.007867
1 0.164139 ... 0.006501
2 -0.049930 ... -0.025211
3 0.088214 ... 0.006827
4 0.134941 ... -0.003503
... ... ... ...
164867 -0.002328 ... 0.034847
164868 -0.099134 ... -0.023980
164869 -0.090375 ... 0.006275
164870 -0.020125 ... -0.045098
164871 -0.093905 ... 0.019036
Principal Component 92 Principal Component 93 \
0 -0.049826 -0.043934
1 0.051111 -0.014386
2 0.033311 0.016293
3 -0.034728 0.017011
4 0.037308 -0.000272
... ... ...
164867 -0.063548 -0.080737
164868 -0.022814 -0.032386
164869 0.000368 -0.029301
164870 0.050164 0.047650
164871 0.011162 -0.014474
Principal Component 94 Principal Component 95 \
0 0.069610 0.026564
1 0.018647 0.028698
2 0.043642 -0.037273
3 0.010588 0.026261
4 0.017225 0.002049
... ... ...
164867 0.025168 0.007811
164868 -0.018428 -0.042140
164869 -0.010469 -0.028149
164870 0.021030 0.061822
164871 0.046220 -0.038785
Principal Component 96 Principal Component 97 \
0 0.046597 0.044868
1 0.016389 0.023901
2 0.059850 -0.002918
3 0.024291 0.048896
4 0.012255 0.022621
... ... ...
164867 0.036191 0.004222
164868 -0.011482 0.008537
164869 -0.031946 0.002331
164870 0.018912 0.023214
164871 -0.001332 -0.009596
Principal Component 98 Principal Component 99 \
0 0.033501 -0.050333
1 0.021223 -0.001277
2 0.010585 -0.012571
3 0.014030 -0.024417
4 -0.050565 0.023814
... ... ...
164867 0.015632 0.003004
164868 -0.003571 0.017377
164869 0.002713 -0.006098
164870 -0.002607 -0.027681
164871 0.021421 -0.011608
Principal Component 100
0 -0.025701
1 -0.020980
2 0.012114
3 0.027481
4 -0.013738
... ...
164867 0.039442
164868 0.018483
164869 0.015812
164870 -0.043942
164871 -0.003840
[164872 rows x 100 columns]
reduced_tfidf_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 164872 entries, 0 to 164871 Data columns (total 100 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Principal Component 1 164872 non-null float64 1 Principal Component 2 164872 non-null float64 2 Principal Component 3 164872 non-null float64 3 Principal Component 4 164872 non-null float64 4 Principal Component 5 164872 non-null float64 5 Principal Component 6 164872 non-null float64 6 Principal Component 7 164872 non-null float64 7 Principal Component 8 164872 non-null float64 8 Principal Component 9 164872 non-null float64 9 Principal Component 10 164872 non-null float64 10 Principal Component 11 164872 non-null float64 11 Principal Component 12 164872 non-null float64 12 Principal Component 13 164872 non-null float64 13 Principal Component 14 164872 non-null float64 14 Principal Component 15 164872 non-null float64 15 Principal Component 16 164872 non-null float64 16 Principal Component 17 164872 non-null float64 17 Principal Component 18 164872 non-null float64 18 Principal Component 19 164872 non-null float64 19 Principal Component 20 164872 non-null float64 20 Principal Component 21 164872 non-null float64 21 Principal Component 22 164872 non-null float64 22 Principal Component 23 164872 non-null float64 23 Principal Component 24 164872 non-null float64 24 Principal Component 25 164872 non-null float64 25 Principal Component 26 164872 non-null float64 26 Principal Component 27 164872 non-null float64 27 Principal Component 28 164872 non-null float64 28 Principal Component 29 164872 non-null float64 29 Principal Component 30 164872 non-null float64 30 Principal Component 31 164872 non-null float64 31 Principal Component 32 164872 non-null float64 32 Principal Component 33 164872 non-null float64 33 Principal Component 34 164872 non-null float64 34 Principal Component 35 164872 non-null float64 35 Principal Component 36 164872 non-null float64 36 Principal Component 37 164872 non-null float64 37 Principal Component 38 164872 non-null float64 38 Principal Component 39 164872 non-null float64 39 Principal Component 40 164872 non-null float64 40 Principal Component 41 164872 non-null float64 41 Principal Component 42 164872 non-null float64 42 Principal Component 43 164872 non-null float64 43 Principal Component 44 164872 non-null float64 44 Principal Component 45 164872 non-null float64 45 Principal Component 46 164872 non-null float64 46 Principal Component 47 164872 non-null float64 47 Principal Component 48 164872 non-null float64 48 Principal Component 49 164872 non-null float64 49 Principal Component 50 164872 non-null float64 50 Principal Component 51 164872 non-null float64 51 Principal Component 52 164872 non-null float64 52 Principal Component 53 164872 non-null float64 53 Principal Component 54 164872 non-null float64 54 Principal Component 55 164872 non-null float64 55 Principal Component 56 164872 non-null float64 56 Principal Component 57 164872 non-null float64 57 Principal Component 58 164872 non-null float64 58 Principal Component 59 164872 non-null float64 59 Principal Component 60 164872 non-null float64 60 Principal Component 61 164872 non-null float64 61 Principal Component 62 164872 non-null float64 62 Principal Component 63 164872 non-null float64 63 Principal Component 64 164872 non-null float64 64 Principal Component 65 164872 non-null float64 65 Principal Component 66 164872 non-null float64 66 Principal Component 67 164872 non-null float64 67 Principal Component 68 164872 non-null float64 68 Principal Component 69 164872 non-null float64 69 Principal Component 70 164872 non-null float64 70 Principal Component 71 164872 non-null float64 71 Principal Component 72 164872 non-null float64 72 Principal Component 73 164872 non-null float64 73 Principal Component 74 164872 non-null float64 74 Principal Component 75 164872 non-null float64 75 Principal Component 76 164872 non-null float64 76 Principal Component 77 164872 non-null float64 77 Principal Component 78 164872 non-null float64 78 Principal Component 79 164872 non-null float64 79 Principal Component 80 164872 non-null float64 80 Principal Component 81 164872 non-null float64 81 Principal Component 82 164872 non-null float64 82 Principal Component 83 164872 non-null float64 83 Principal Component 84 164872 non-null float64 84 Principal Component 85 164872 non-null float64 85 Principal Component 86 164872 non-null float64 86 Principal Component 87 164872 non-null float64 87 Principal Component 88 164872 non-null float64 88 Principal Component 89 164872 non-null float64 89 Principal Component 90 164872 non-null float64 90 Principal Component 91 164872 non-null float64 91 Principal Component 92 164872 non-null float64 92 Principal Component 93 164872 non-null float64 93 Principal Component 94 164872 non-null float64 94 Principal Component 95 164872 non-null float64 95 Principal Component 96 164872 non-null float64 96 Principal Component 97 164872 non-null float64 97 Principal Component 98 164872 non-null float64 98 Principal Component 99 164872 non-null float64 99 Principal Component 100 164872 non-null float64 dtypes: float64(100) memory usage: 125.8 MB
After reducing the number of features using PCA, we combine the reduced TF-IDF features with the sentiments features. The reason is that the TF-IDF captures the importance of words in the text, but sentiment scores provide a different perspective by quantifying the emotional tone of the text. Combining both types of features enriches the overall feature set.
# Combine TF-IDF features with sentiment scores
combined_features = pd.concat([reduced_tfidf_df, data[['positive sentiment', 'negative sentiment', 'neutral sentiment', 'compound sentiment']]], axis=1)
# Extract the labels
labels = data['generated']
print(combined_features.shape)
print(combined_features.info())
print('\n')
print(labels.shape)
print(labels.info())
(164872, 104) <class 'pandas.core.frame.DataFrame'> RangeIndex: 164872 entries, 0 to 164871 Columns: 104 entries, Principal Component 1 to compound sentiment dtypes: float64(104) memory usage: 130.8 MB None (164872,) <class 'pandas.core.series.Series'> RangeIndex: 164872 entries, 0 to 164871 Series name: generated Non-Null Count Dtype -------------- ----- 164872 non-null int64 dtypes: int64(1) memory usage: 1.3 MB None
70% of the data is used for training the models, 15% is set aside for validation purposes, and the remaining 15% is reserved as the test set. The test set will only be used for evaluating the performance of the best selected model.
# Split the data: 70% training, 15% validation, 15% test
X_train, X_temp, y_train, y_temp = train_test_split(combined_features, labels, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
(X_train.shape, X_temp.shape, y_train.shape, y_temp.shape)
((115410, 104), (49462, 104), (115410,), (49462,))
(X_val.shape, X_test.shape, y_val.shape, y_test.shape)
((24731, 104), (24731, 104), (24731,), (24731,))
SMOTE is used to balance the class distribution by generating synthetic data objects for the minority class, which helps in training more robust and unbiased models. The SMOTE creates new instances by interpolating between existing instances of the minority class.
We only apply SMOTE on the training set, and not on the validation/test set, and that because the validation/test set is meant to simulate real-world data. Applying SMOTE on the validation/test set would alter its natural distribution, and leading to an unrealistic evaluation of model performance.
# Apply SMOTE to the training data
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
X_train_smote.shape, y_train_smote.shape
((163602, 104), (163602,))
y_df = pd.DataFrame(y_train_smote)
# Visualize the number of generated and non-generated texts
plt.figure(figsize=(8, 6))
sns.countplot(x='generated', data=y_df)
plt.title('Distribution of Generated vs. Non-Generated Texts')
plt.xlabel('Generated (0 = Human, 1 = AI)')
plt.ylabel('Number of Essays')
plt.show()
In selecting models for our text classification task, we focused on a diverse set of algorithms that bring various strengths and capabilities to the problem. Here are the models we chose and the reasons for selecting each:
Logistic Regression - Logistic Regression is a staple in classification tasks, particularly useful in text classification for several reasons:
Random Forest - Random Forest is an ensemble method that leverages the power of multiple decision trees, providing:
Support Vector Machine (SVM) - Support Vector Machine is highly effective in high-dimensional spaces, which is common in text data:
Gradient Boosting - Gradient Boosting builds models sequentially to correct errors from previous models, offering:
Neural Network - Neural Networks bring a different approach to text classification with their ability to learn complex patterns directly from the data:
Using this diverse set of models allows us to leverage their individual strengths to effectively deal with the text classification task. This approach enables us to compare performance across different methods and select the most suitable model for our specific needs.
# Initialize the models
models = {
'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Support Vector Machine': SVC(kernel='linear', random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}
for model_name, model in models.items():
print(f'Training {model_name}...')
# Train the model
model.fit(X_train_smote, y_train_smote)
Training Logistic Regression... Training Random Forest... Training Support Vector Machine... Training Gradient Boosting...
conf_matrices = []
class_reports = {}
model_names = []
results = {}
for model_name, model in models.items():
# Evaluate on the validation set
y_val_pred = model.predict(X_val)
cls_rpt = classification_report(y_val, y_val_pred)
class_reports[model_name] = cls_rpt
# Store the confusion matrix
conf_matrix = confusion_matrix(y_val, y_val_pred)
conf_matrices.append(conf_matrix)
model_names.append(model_name)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_precision = precision_score(y_val, y_val_pred)
val_recall = recall_score(y_val, y_val_pred)
val_f1 = f1_score(y_val, y_val_pred)
results[model_name] = {
"accuracy": val_accuracy,
"precision": val_precision,
"recall": val_recall,
"f1": val_f1
}
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train_smote) # encode the smote train set
y_val_encoded = label_encoder.transform(y_val)
y_test_encoded = label_encoder.transform(y_test)
nn_model = Sequential([
Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
Dropout(0.3),
Dense(64, activation='relu'),
Dropout(0.3),
Dense(1, activation='sigmoid')
])
# Compile the model
nn_model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])
nn_model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 128) │ 13,440 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dropout (Dropout) │ (None, 128) │ 0 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dense_1 (Dense) │ (None, 64) │ 8,256 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dropout_1 (Dropout) │ (None, 64) │ 0 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dense_2 (Dense) │ (None, 1) │ 65 │ └──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
Total params: 21,761 (85.00 KB)
Trainable params: 21,761 (85.00 KB)
Non-trainable params: 0 (0.00 B)
nn_model.fit(X_train_smote, y_train_encoded, epochs=10, batch_size=32, validation_data=(X_val, y_val_encoded), verbose=1)
Epoch 1/10 5113/5113 [==============================] - 19s 3ms/step - loss: 0.0706 - accuracy: 0.9757 - val_loss: 0.0373 - val_accuracy: 0.9878 Epoch 2/10 5113/5113 [==============================] - 17s 3ms/step - loss: 0.0352 - accuracy: 0.9885 - val_loss: 0.0314 - val_accuracy: 0.9902 Epoch 3/10 5113/5113 [==============================] - 15s 3ms/step - loss: 0.0273 - accuracy: 0.9911 - val_loss: 0.0253 - val_accuracy: 0.9921 Epoch 4/10 5113/5113 [==============================] - 15s 3ms/step - loss: 0.0243 - accuracy: 0.9920 - val_loss: 0.0227 - val_accuracy: 0.9928 Epoch 5/10 5113/5113 [==============================] - 15s 3ms/step - loss: 0.0213 - accuracy: 0.9928 - val_loss: 0.0219 - val_accuracy: 0.9937 Epoch 6/10 5113/5113 [==============================] - 15s 3ms/step - loss: 0.0193 - accuracy: 0.9939 - val_loss: 0.0194 - val_accuracy: 0.9944 Epoch 7/10 5113/5113 [==============================] - 15s 3ms/step - loss: 0.0172 - accuracy: 0.9941 - val_loss: 0.0195 - val_accuracy: 0.9945 Epoch 8/10 5113/5113 [==============================] - 15s 3ms/step - loss: 0.0165 - accuracy: 0.9943 - val_loss: 0.0177 - val_accuracy: 0.9943 Epoch 9/10 5113/5113 [==============================] - 15s 3ms/step - loss: 0.0158 - accuracy: 0.9945 - val_loss: 0.0190 - val_accuracy: 0.9941 Epoch 10/10 5113/5113 [==============================] - 16s 3ms/step - loss: 0.0148 - accuracy: 0.9948 - val_loss: 0.0179 - val_accuracy: 0.9948
<keras.src.callbacks.History at 0x7d5292c4bac0>
Make predictions on the Validation Set:
nn_y_val_pred_prob = nn_model.predict(X_val)
nn_y_val_pred = (nn_y_val_pred_prob > 0.5).astype(int) # Convert probabilities to binary predictions
773/773 [==============================] - 2s 2ms/step
# Evaluate the Neural Network
accuracy_nn = accuracy_score(y_val_encoded, nn_y_val_pred)
precision_nn = precision_score(y_val_encoded, nn_y_val_pred)
recall_nn = recall_score(y_val_encoded, nn_y_val_pred)
f1_nn = f1_score(y_val_encoded, y_val_pred)
roc_auc_nn = roc_auc_score(y_val_encoded, nn_y_val_pred)
results["Neural Network"] = {
"accuracy": accuracy_nn,
"precision": precision_nn,
"recall": recall_nn,
"f1": f1_nn
}
Confusion Matrices and Classification Reports
labels = ['Human-Written', 'AI Generated']
# Visualize the confusion matrices
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
for i, (model_name, conf_matrix) in enumerate(zip(model_names, conf_matrices)):
# Print classification report
print(f'Accuracy of {model_name}: {results[model_name]["accuracy"]:.4f}')
print(class_reports[model_name])
print('\n' + '='*60 + '\n')
ax = axes[i//2, i%2]
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', ax=ax, xticklabels=labels, yticklabels=labels)
ax.set_title(f'Confusion Matrix for {model_name}')
ax.set_xlabel('Predicted')
ax.set_ylabel('True')
plt.tight_layout()
plt.show()
Accuracy of Logistic Regression: 0.9755
precision recall f1-score support
0 0.99 0.98 0.98 17530
1 0.95 0.97 0.96 7201
accuracy 0.98 24731
macro avg 0.97 0.97 0.97 24731
weighted avg 0.98 0.98 0.98 24731
============================================================
Accuracy of Random Forest: 0.9968
precision recall f1-score support
0 1.00 1.00 1.00 17530
1 1.00 0.99 0.99 7201
accuracy 1.00 24731
macro avg 1.00 1.00 1.00 24731
weighted avg 1.00 1.00 1.00 24731
============================================================
Accuracy of Support Vector Machine: 0.9765
precision recall f1-score support
0 0.99 0.98 0.98 17530
1 0.95 0.97 0.96 7201
accuracy 0.98 24731
macro avg 0.97 0.98 0.97 24731
weighted avg 0.98 0.98 0.98 24731
============================================================
Accuracy of Gradient Boosting: 0.9682
precision recall f1-score support
0 0.98 0.97 0.98 17530
1 0.93 0.96 0.95 7201
accuracy 0.97 24731
macro avg 0.96 0.97 0.96 24731
weighted avg 0.97 0.97 0.97 24731
============================================================
Confusion Matrix and Classification Report
# Print classification report
print(f'Accuracy of Neural Network: {accuracy_nn:.4f}')
cr = classification_report(y_val_encoded, nn_y_val_pred)
print('Classification Report:\n', cr)
nn_val_conf_matrix = confusion_matrix(y_val_encoded, nn_y_val_pred)
labels = ['Human-Written', 'AI Generated']
# Plotting the confusion matrix as a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(nn_val_conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
plt.title('Confusion Matrix - Neural Network on Validation Data')
plt.show()
Accuracy of Neural Network: 0.9948
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 17530
1 0.99 0.99 0.99 7201
accuracy 0.99 24731
macro avg 0.99 0.99 0.99 24731
weighted avg 0.99 0.99 0.99 24731
# Convert results to DataFrame
results_df = pd.DataFrame(results).T
print('Model Performance (Validation Set):')
print(results_df)
# ROC curves
for model_name, model in models.items():
if hasattr(model, "predict_proba"):
y_pred_prob = model.predict_proba(X_val)[:, 1]
else: # for SVM with probability=True
y_pred_prob = model.decision_function(X_val)
fpr, tpr, _ = roc_curve(y_val_encoded, y_pred_prob)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=2, label=f'{model_name} (AUC = {roc_auc:.4f})')
# Add ROC curve for Neural Network
fpr_nn, tpr_nn, _nn = roc_curve(y_val_encoded, nn_y_val_pred)
plt.plot(fpr_nn, tpr_nn, lw=2, label=f'Neural Network (AUC = {roc_auc_nn:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve (Validation Set)')
plt.legend(loc="lower right")
plt.show()
plt.tight_layout()
plt.show()
Model Performance (Validation Set):
accuracy precision recall f1
Logistic Regression 0.975496 0.945909 0.971393 0.958482
Random Forest 0.996806 0.997069 0.991946 0.994501
Support Vector Machine 0.976507 0.948753 0.971809 0.960143
Gradient Boosting 0.968218 0.931754 0.961255 0.946275
Neural Network 0.994784 0.992068 0.990001 0.946275
<Figure size 640x480 with 0 Axes>
Confusion Matrices:
The Random Forest and Neural Network models show the best performance with the highest true positives and true negatives and the lowest false positives and false negatives.
Support Vector Machine and Logistic Regression also perform well, while Gradient Boosting has slightly higher false positives and false negatives compared to the other models.
Performance Metrics:
Random Forest and Neural Network models show the highest performance across all metrics.
The Logistic Regression model is also highly efficient, performing well and running very fast along with the Neural Network.
The Support Vector Machine and Gradient Boosting models perform well but have longer running times and slightly lower metrics compared to the top performers.
ROC Curves:
The ROC curves for all models show excellent performance with areas under the curve (AUC) very close to 1. The Random Forest model has the highest AUC (0.9999), indicating almost perfect classification ability, followed closely by the Neural Network (0.9943).
Choosing the Best Models:
Based on the analysis of the confusion matrices, performance metrics, and running times, we would choose to use the Neural Network and Random Forest models for our test evaluation.
best_models = {
'Neural Network': nn_model,
'Random Forest': models['Random Forest']
}
# Plot feature importance
def plot_feature_importance(model, model_name):
if model_name == 'Logistic Regression':
feature_importance = abs(model.coef_[0])
elif model_name == 'Random Forest' or model_name == 'Gradient Boosting':
feature_importance = model.feature_importances_
elif model_name == 'Support Vector Machine' and model.kernel == 'linear':
feature_importance = abs(model.coef_[0])
else:
return # SVM with non-linear kernel does not have a straightforward feature importance
# Sort feature importances in descending order
indices = np.argsort(feature_importance)[::-1]
# Select the top 20 features
top_indices = indices[:20]
top_features = combined_features.columns[top_indices]
top_importance = feature_importance[top_indices]
plt.figure(figsize=(12, 8))
sns.barplot(x=top_importance, y=top_features)
plt.title(f'Feature Importance for {model_name}')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
# Plot feature importance for each model
for model_name, model in models.items():
plot_feature_importance(model, model_name)
The distribution of feature importance suggests that a combination of various principal components and sentiment analysis is essential for the model's performance.
#test_model_names = []
test_results = {}
model = best_models['Random Forest']
model_name = 'Random Forest'
# Evaluate on the test set
rf_y_test_pred = model.predict(X_test)
rf_test_accuracy = accuracy_score(y_test, rf_y_test_pred)
rf_test_precision = precision_score(y_test, rf_y_test_pred)
rf_test_recall = recall_score(y_test, rf_y_test_pred)
rf_test_f1 = f1_score(y_test, rf_y_test_pred)
# Store the confusion matrix
rf_test_conf_matrix = confusion_matrix(y_test, rf_y_test_pred)
test_results[model_name] = {
"accuracy": rf_test_accuracy,
"precision": rf_test_precision,
"recall": rf_test_recall,
"f1": rf_test_f1
}
Make predictions on the Test Set:
# Predict probabilities for the test set
nn_y_test_pred_prob = nn_model.predict(X_test)
# Convert probabilities to binary predictions (0 or 1)
nn_y_test_pred = (nn_y_test_pred_prob > 0.5).astype(int)
773/773 [==============================] - 2s 2ms/step
# Evaluate Neural Network
accuracy_nn_t = accuracy_score(y_test_encoded, nn_y_test_pred)
precision_nn_t = precision_score(y_test_encoded, nn_y_test_pred)
recall_nn_t = recall_score(y_test_encoded, nn_y_test_pred)
f1_nn_t = f1_score(y_test_encoded, nn_y_test_pred)
roc_auc_nn_t = roc_auc_score(y_test_encoded, nn_y_test_pred)
test_loss, test_accuracy = nn_model.evaluate(X_test, y_test_encoded)
# Calculate the confusion matrix for the Neural Network model
nn_test_conf_matrix = confusion_matrix(y_test_encoded, nn_y_test_pred)
test_results["Neural Network"] = {
"accuracy": accuracy_nn_t,
"precision": precision_nn_t,
"recall": recall_nn_t,
"f1": f1_nn_t
}
773/773 [==============================] - 1s 2ms/step - loss: 0.0170 - accuracy: 0.9953
# Print classification report
print(f'Accuracy of Random Forest: {rf_test_accuracy:.4f}')
print(classification_report(y_test, rf_y_test_pred))
print('\n' + '='*60 + '\n')
print(f'Accuracy of Neural Network: {accuracy_nn_t:.4f}')
print("Classification Report:")
print(classification_report(y_test_encoded, nn_y_test_pred))
Accuracy of Random Forest: 0.9966
precision recall f1-score support
0 1.00 1.00 1.00 17416
1 1.00 0.99 0.99 7315
accuracy 1.00 24731
macro avg 1.00 0.99 1.00 24731
weighted avg 1.00 1.00 1.00 24731
============================================================
Accuracy of Neural Network: 0.9953
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 17416
1 0.99 0.99 0.99 7315
accuracy 1.00 24731
macro avg 0.99 0.99 0.99 24731
weighted avg 1.00 1.00 1.00 24731
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Plotting the confusion matrix for the Random Forest model
sns.heatmap(rf_test_conf_matrix, annot=True, fmt='d', cmap='Blues', ax=axes[0], xticklabels=labels, yticklabels=labels)
axes[0].set_title('Confusion Matrix - Random Forest on Test Data')
axes[0].set_xlabel('Predicted Label')
axes[0].set_ylabel('True Label')
# Plotting the confusion matrix for the Neural Network model
sns.heatmap(nn_test_conf_matrix, annot=True, fmt='d', cmap='Blues', ax=axes[1], xticklabels=labels, yticklabels=labels)
axes[1].set_title('Confusion Matrix - Neural Network on Test Data')
axes[1].set_xlabel('Predicted Label')
axes[1].set_ylabel('True Label')
# Adjust layout for better visualization
plt.tight_layout()
plt.show()
test_results_df = pd.DataFrame(test_results).T
fig, ax = plt.subplots(figsize=(6, 6))
ax.axis('tight')
ax.axis('off')
table = ax.table(cellText=test_results_df.values,
colLabels=test_results_df.columns,
rowLabels=test_results_df.index,
cellLoc='center',
loc='center',
colColours=['#f2f2f2']*len(test_results_df.columns))
# Adjust table font size
table.auto_set_font_size(False)
table.set_fontsize(20)
table.scale(3, 4) # adjust scale for better readability
plt.show()
# Plot ROC curve for Random Forest
model = best_models['Random Forest']
if hasattr(model, "predict_proba"):
y_pred_prob = model.predict_proba(X_test)[:, 1]
else:
y_pred_prob = model.decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test_encoded, y_pred_prob)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=2, label=f'Random Forest (AUC = {roc_auc:.4f})')
# Plot ROC curve for Neural Network
fpr_nn, tpr_nn, _nn = roc_curve(y_test_encoded, nn_y_test_pred)
roc_auc_nn = auc(fpr_nn, tpr_nn)
plt.plot(fpr_nn, tpr_nn, lw=2, label=f'Neural Network (AUC = {roc_auc_nn:.4f})')
# Plot the diagonal line for reference
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
# Customize the plot
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve (Test Set)')
plt.legend(loc="lower right")
# Show the plot
plt.tight_layout()
plt.show()
Random Forest:
Neural Network:
Conclusions:
In summary, both models perform exceptionally well, with Random Forest having a slight edge in precision and Neural Network having a slight edge in recall. The choice between them should be based on the specific needs of the application, whether the focus is on minimizing false positives or false negatives.
We successfully completed the project, meeting the requirements of the Kaggle task and achieving high percentages of accuracy in the classification tasks.
The EDA was particularly interesting, yielding valuable insights that informed our approach and contributed to the overall success of the project.
The Random Forest and Neural Network models were identified as the best performing models, demonstrating their capability to accurately differentiate between human-written and AI-generated texts.